-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1: server: make the span stats fan-out more fault tolerant #109008
release-23.1: server: make the span stats fan-out more fault tolerant #109008
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is a backport it needs to have that on the title, so # release-23.1: server: make...
and then you need to add Release justification:
to the PR description
Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @zachlite)
pkg/server/status.go
line 2851 at r1 (raw file):
} const noTimeout time.Duration = 0
on master this const is created on the file pagination.go, this file doesn't exist on this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @maryliag)
pkg/server/status.go
line 2851 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
on master this const is created on the file pagination.go, this file doesn't exist on this branch?
I'm working on a separate backport for #107796.
We can wait and rebase this PR on top of that upcoming backport.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @zachlite)
pkg/server/status.go
line 2851 at r1 (raw file):
Previously, zachlite wrote…
I'm working on a separate backport for #107796.
We can wait and rebase this PR on top of that upcoming backport.
Sounds better! Let's get that one if first! thank you
will rebase on top of #109015 |
f137ca2
to
d266fc3
Compare
@maryliag - rebase is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian)
Will rebase on top of the latest commits on |
This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.
d266fc3
to
55ac3ec
Compare
Backport 1/1 commits from #108456
This commit adds improved fault tolerance to the span stats fan-out:
Errors that are encountered are logged, and then appended to the response in the newly added
Errors
field.iterateNodes
. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute.Resolves #106097
Epic: none
Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.
Release justification: low risk fault tolerance improvements